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Abstract 

A generic algorithm for the extraction of probabilistic (Bayesian) information about 
model parameters from data is presented. The algorithm propagates an ensemble of 
particles in the product space of model parameters and outputs. Each particle update 
consists of a random jump in parameter space followed by a simulation of a model out¬ 
put and a Metropolis acceptance/rejection step based on a comparison of the simulated 
output to the data. The distance of a particle to the data is interpreted as an energy and 
the algorithm is reducing the associated temperature of the ensemble such that entropy 
production is minimized. If this simulated annealing is not too fast compared to the 
mixing speed in parameter space, the parameter marginal of the ensemble approaches the 
Bayesian posterior distribution. Annealing is adaptive and depends on certain extensive 
thermodynamic quantities that can easily be measured throughout run-time. In the gen¬ 
eral case, we propose annealing with a constant entropy production rate, which is optimal 
as long as annealing is not too fast. For the practically relevant special case of no prior 
knowledge, we derive an optimal fast annealing schedule with a non-constant entropy pro¬ 
duction rate. The algorithm does not require the calculation of the density of the model 
likelihood, which makes it interesting for Bayesian parameter inference with stochastic 
models, whose likelihood functions are typically very high dimensional integrals. 


1 Introduction 


While the amount of data measured in various complex systems keeps growing we are in 
need of algorithms that allow us to extract useful information from this data. If we manage 
to set up a model, of which the data appears to be a realization, this information can 
conveniently be expressed as a distribution of model parameters that is ” compatible” with 
the data. If the model is to be used to make reliable predictions, it is important for it to 
faithfully reproduce the relevant statistics of the data (e.g. frequency of extreme events), 
which naturally leads us to stochastic models with non-trivial output distributions. 

Bayesian statistics is a framework in which a stochastic model is expressed through a 
conditional probability distribution {likelihood function), L(x|0), for observable outputs x, 
given model parameters 6, and knowledge (or belief) about model parameters is expressed 
through a probability distribution as well. Assume we have prior knowledge in the form 
of a probability distribution, fpri{0), and measured data, y, which is believed to be a 
realization of the model. The posterior knowledge, combining prior knowledge with the 
one acquired from data, is calculated by means of eq. 


fpost i0\y) 


L{y\0)fpri{0) 

j L{y\e')fp„{e')ci^' ■ 


( 1 ) 
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This so-called Bayes theorem is consistent with sequential learning. The posterior becomes 
the new prior when new data becomes available, and the posterior fpost{6\yi, ■ ■ ■ ,yn) does 
not depend on the order in which the data points yi through y„ are learned from. 

Since Q is usually intractable, analytically, Bayesian inference algorithms usually aim 
at generating, by means of Monte Carlo methods, a sufficiently large parameter sample 
from it. This sample expresses our posterior knowledge about model parameters. The 
maximum of the posterior constitutes our best guess about model parameters. It is a 
compromise between our prior knowledge and the best model fit of the data. But we 
are really interested in the whole posterior distribution, as it expresses the uncertainty 
about model parameters. Predictive uncertainty can conveniently be estimated through 
sampling one (or several) model outputs, for each posterior parameter sample point, from 
an appropriate likelihood function. For a recent review on Bayesian inference in physics 
see [2^ . 

In cases where the likelihood function can be evaluated in reasonable time, the Metropo¬ 
lis algorithm [5] with its various ramifications is an efficient way of generating a posterior 
parameter sample. Usually, first, a global optimizer is employed to determine the max¬ 
imum of the posterior, which serves as the starting point for the Metropolis algorithm. 
In practice, evaluating L{y\d), for tens of thousands of parameter combinations, which is 
typically required for both the optimization and the Metropolis algorithm, is often not 
feasible. If evaluating L(y|0), for given y and 0, is slow, we have to distinguish two cases, 
depending on whether simulating a model output, for given 6, is slow or fast. 

As a typical example for the first case consider a slow deterministic simulation model, 
g(@), together with a multivariate normal error with mean zero that is added to this de¬ 
terministic model output. In practice, such normal errors are often used and supposed to 
lump the effects of input, model structural and measurement uncertainties m- Simulat¬ 
ing a model output and evaluating the model likelihood for the data, for a given 0, are 
then more or less of the same cost. Both require the calculation of g{9). In addition, to 
simulate a model output, we need to draw from a normal distribution, whereas to evaluate 
the model likelihood we need to evaluate the associated normal density. The cost of these 
last steps is typically negligible compared to the calculation of g{0). If one wants to make 
a Bayesian inference with such models, one needs to think of ways how to speed up the 
calculation of g{6) [HIlIT^ IT]. 

As a typical example for the second case, consider a dynamical model that is defined 
through a stochastic differential equation (SDE), and whose output is a time-series. Sim¬ 
ulating an output with such a model is often very fast; it simply requires the generation of 
a realization of the stochastic process defined by the SDE. On the other hand, evaluating 
the likelihood, for a given measured time-series, is typically very expensive. It requires 
us to numerically solve a path-integral over all possible realizations of the SDE that are 
compatible with the measured time-series. It is this second kind of problems this paper 
is concerned with: How to do a Bayesian inference with models that allow for a quick 
realization of an output, but whose likelihood densities are expensive to evaluate. Indeed, 
a faithful account of uncertainty in models of complex systems typically does require us 
to use stochastic models whose likelihood functions are very high dimensional intractable 
integrals. 

Therefore, in recent years, there has been a great interest in so-called Approximate 
Bayes Computations (ABC), especially within the statistics community (see [7] for a recent 
review). These algorithms are based on simulating outputs from the model likelihood and 
comparing them with the data rather than evaluating the likelihood function, for the data. 
Comparison with the data requires introducing a metric on the space of outputs as well 
as a tolerance within which a simulated output is deemed compatible with the data. Eor 
a posterior that is very different from the prior it is beneficial to start with a relatively 
large tolerance and decrease it during the course of the algorithm. Thereby, an ensemble 
of particles is iteratively updated so as to converge to the exact posterior along a series of 
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approximations, associated with a decreasing series of tolerances. For continuous outputs 
the tolerance will only asymptotically approach zero and, even for moderately high output 
dimensions, we will have to live with a finite tolerance that justifies the word approximate 
in ABC. If the output dimension is large (such as a time-series) we need to use appropriate 
summary statistics. If model parameters are few, it is often possible to find few summary 
statistics that contain almost the same amount of information about the parameters as 
the original output [5]. The ABC framework is generic in the sense that it only requires 
us to simulate outputs, for given parameter sets. No other information about the model 
is required. We do not even need access to the model equations. 

Recently, a framework of ABC algorithms [5] has been introduced that is inspired 
by adaptive simulated annealing algorithms that were developed for optimization tasks 
[3 dm [HIS]. These so-called SABC algorithms (merging SA, for simulated annealing, 
with ABC), attempt to move an ensemble of particles as closely as possible to the ex¬ 
act posterior with as few computer updates as possible. Each fundamental update step 
consists of a random jump in parameter space followed by a simulation of an output and 
a Metropolis acceptance/rejection step that depends on a tolerance that is continuoulsy 
decreased during run-time. These update steps are associated with a flow of entropy from 
the system (the ensemble of particles in the product space of parameters and ouputs) to 
the environment. Part of this flow is due to the decrease of entropy in the system when 
it transforms from the prior to the posterior state and constitutes the well-invested part 
of computation. Since the process happens in finite time, inevitably, additional entropy 
is produced. This entropy production is used as a measure of the wasted computation 
and minimized, as previously suggested for adaptive simulated annealing [161 IS]- The 
algorithm adapts the tolerance (temperature) according to certain measurable extensive 
thermodynamic quantities of the system, in the easiest case simply the total distance of 
the particles to the target (measured data). We assume that the system, at any time, is 
approximately described by a Gibbs state w.r.t. these quantities. Under this so-called 
endoreversibility assumption [13] the entropy production is determined by the flux of the 
extensive thermodynamic quantities times the corresponding thermodynamic forces. Min¬ 
imization of this quantity can be carried out analytically |18| and leaves us with a family 
of algorithms that is basically parameterized by two parameters, governing the annealing 
speed and the equilibration (mixing) speed, respectively. 

The convergence of adaptive SABC algorithms hinges on the endoreversibility assump¬ 
tion being satisfied. If it is not, additional entropy is produced within the system. This 
entropy can even remain in the system indefinitely and lead to a biased convergence. Tra¬ 
ditional ABC algorithms |7| enforce convergence through importance sampling, but the 
ensuing loss of effective sample size reduces their efficiency. Furthermore, they lack an 
information-theoretic criterion for the speed with which the tolerance is decreased. In |2] 
the endoreversibility assumption is assumed to be true if mixing in parameter space is 
fast enough compared to annealing. This means that the jump width in parameter space 
has to be chosen sufficiently large, depending on the annealing speed. Since, in practice, 
we are interested in fast annealing schedules, we further elaborate on this assumption in 
this paper. We first note that, even for infinitely fast mixing, the system does not follow 
a sequence of Gibbs states, i.e., there is always additional internal entropy production, 
which is not accounted for in our optimization. This might sound counter-intuitive, at 
first, but is simply due to the fact that our Metropolis particle update rule does not 
simulate a real gas or fluid. However, for the practically relevant case of negligible prior 
information and fast annealing, we show that this additional entropy production is small 
compared to the entropy production caused by heat flow, as long as mixing in parameter 
space is fast enough. Indeed, in this case, the system approximately follows a sequence of 
Gibbs states, in the limit of infinitely fast annealing and infinitely fast mixing. 

In order to be self-sufficient, this paper repeats the main ideas of |5], but with a 
notation that complies with physics standards. For technical details as well as various ex- 
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amples that show the competitive performance of SABC algorithms the reader is referred 
to the original publication [2]. 

2 The SABC framework 

We define our system as a time-dependent probability distribution, represented 

by a moving ensemble of particles, E = {di,Xi}fLi, in the product space of model pa¬ 
rameters and associated model outputs. The energy, u(x), of a particle measures the 
distance of its output component to the data y, w.r.t. some metric on the output space. 
To simplify notation, we omit the y-dependence of u. The basic idea behind the SABC 
framework is to generate, through repeated model simulations, a canonical ensemble E 
described by the distribution 

^Te(0,x) = , (2) 

for a sufficiently low temperature T®. Then, the 0-marginal of ([^ will be a good approxi¬ 
mation of the posterior Q, i.e., 

fpost{0\y) ~ y 7rT«(0,x)dx. (3) 

If the output space is M", the energy m(x) might read as 

1 ” 

u(x) = p(x, y) = - ^ Ixi - 1“ , (4) 

a ^ 


for some constant a > 0. 

It is easy to write down rules for particle updates that do not require us to calculate 
the density of the likelihood, and which have (|^ as equilibrium distribution (see eqs. Q 
or (25) below). For posteriors that are very different from the prior, it seems reasonable, 
in the spirit of simulated annealing, to make the temperature time-dependent, T® = T^{t), 
and reduce it during the course of the algorithm. The crucial question is then how fast 
we should reduce this temperature in order to have a fast convergence to the correct 
result. In [2] a convergence proof has been given, for the case when annealing obeys 
T®(t) > const and the metric (|^ is used. Compare this result with the logarithmic 
annealing that is required, for optimizers that are based on simulated annealing to 
converge. However, these results are of little practical value. What we are interested in 
are annealing schedules that take us as close as possible to the correct result after a finite 
number of particle updates. 

In practice, we often lack prior knowledge about model parameters, and the posterior 
is mainly data driven. We derive an efficient SABC algorithm, for this special case, in the 
next subsection and come back to the general case at the end of this section. 


2.1 The case of negligible prior knowledge 

Consider the case of a prior that varies much less than the likelihood function and set 


/p„(0) = const . (5) 

A simple transition rate, qT’‘{iB,x) (0',x')), that has ([^ as equilibrium distribution, 
i.e, satisfies the detailed balance condition 

7rTe(0,x)qT=((0,x) ^ (0',x')) = 7rTe(0',x')gr=((0',x') (0,x)) (6) 
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is given by eq. 


(ZT»((0,x) ^ (0',x')) = , (7) 

where k{6,6') denotes some symmetric transition rate in parameter space. According 
to Q, a particle update consists in a random jump in parameter space, followed by 
the simulation of an output from the model and a Metropolis acceptance/rejection step 
that depends solely on the change of the particle’s distance to the data. We allow the 
environmental temperature to be time-dependent, T® = T'^{t). For a sufficiently large 
number of particles, the dynamics of the ensemble is then approximately described by the 
master equation 

= Jt)qT-(t) (z' ^ z) - ^(z, t)gT-(t) (z z'))dz', (8) 

where we have used the shorthand z = (0,x)^. In ([^, time is measured in units of single 
particle updates. It is clear that for sufficiently slow annealing and possibly after an initial 
burn-in period, the system will be approximately described by the equilibrium distribution 
([^, with r® replaced by T®(t). If the annealing is too fast, however, convergence might 
again become slow, or the system might converge to a biased result. 

It is intuitively clear that slow or even biased convergence can occur if the relaxational 
processes within the system (controlled by the jump distribution k{6,6')) are slow com¬ 
pared to the annealing speed. For sufficiently fast mixing, on the other hand, we work 
under the endoreversibility assumption, which states that, at any time, the system is 
approximately described by an equilibrium distribution ii), 

/x(z,t) « 7rT(t)(z), (9) 


but with a system temperature Tit) that is higher than the environmental temperature 
T®(t) used for the particle updates. We will take this assumption for granted, for the time 
being, and confirm its validity at the end of this subsection. 

To find an efficient annealing scheme, we need an information-theoretic framework. 
Following [3], particle updates are associated with a flow of information from the en¬ 
vironment into the system, or, equivalently, as an entropy flow from the system to the 
environment. Part of this entropy stems from the reduction of the system entropy when it 
transforms from the prior to the posterior. This part is path-independent and constitutes 
the well invested part of the computation. The rest is the inevitable entropy production, 
which is computational waste we intend to minimize. According to entropy produc¬ 
tion can be understood as loss of information due to the rejections encountered during 
annealing relative to the inverse (heating) process. It is thus a measure for the irre¬ 
versibility of the computation. However, there is also a ’’reversible computational waste”: 
Running the algorithm at equilibrium does not produce any entropy. Nevertheless it is to 
be avoided as it does not lead to a reduction of the system entropy either. This is due to 
the fact that we disregard the information stored in the history of each particle and only 
consider the current state. Minimizing entropy production under the endoreversibility 
assumption will leave us with a family of algorithms that is essentially parameterized by 
two parameters governing, respectively, the mixing in parameter space and the annealing 
speed. These parameters will have to be tuned in such a way that (i) the endoreversibility 
assumption is justified and (ii) there is not too much reversible computational waste. 

Under the endoreversibility assumption © the entropy production rate, &{t), is given 

by [g 


&{t) = Uit)F(t), 


( 10 ) 
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where U{t) denotes the system’s energjQ 


U{t) = N 


N 

u(x)^(z, t)dz « ^ M(Xi) , 

i=l 


( 11 ) 


and F(t) the associated thermodynamic force 


F{t) = 


1 


T{t) T«(<) ■ 


( 12 ) 


Under fixed initial and final energies, an easy exercise in variational calculus m reveals 
the necessary and sufficient condition for minimal total entropy production to be given 
by eq. 

OF ■ 

— -U^ = V = const . (13) 

dU 


In order to derive an adaptive annealing schedule from (131 we need to know both the 


functional dependence between U and F and the functional dependence between U and 
T. 

In standard SA algorithms that are used as global optimizers, all the information is in 
the metric. In our case, the metric is just an auxiliary construct and all the information 
lies in the likelihood function. We are thus allowed to redefine the metric in order to 
design a more efficient algorithm. If p(x, y) is some user-defined metric (such as the one 
defined by the right eq. of Q), we replace the left eq. of @ by eq. 


u(x) := / L{x.'\6)fpri{6)d0dx.'. 

•^p(x':y)<p(x.y) 


(14) 


This new energy has a uniform prior density on the interval [0,1], fpriiu) = x([0,1]). 
Thus, through a redefinition of the user-defined metric p to the one defined in (14), we 


achieve a specific heat capacity that is approximately unity. Indeed, w.r.t. this new metric, 
we see that, neglecting boundary terms of order 


U{t) « NT{t). 


(15) 


The relationship between this new metric and the user-defined metric p can be established, 
approximately, during the initialization of the algorithm, when the initial ensemble is 
generated [2]. 


Using the new metric (14) we also find that, up to quadratic order in T and T® [2], 
U{t) « U{T, T") « -7(T2 - {T^f ), (16) 


with 


7= (^1 L{y\e)fp„[0)d0^ J k{0,0')Liy\0)fp„i0)Liy\0')fpr^{0')d0d0'. (17) 


Using (15) and (16), criterion now reads as 

{U{t)/N -T^{tYf V 
2r«(<)3 “ 7 


From (15), (16) and (18) we derive the asymptotics of the annealing as [2] 

T^t) - ^ 


(18) 


(19) 


^In [2], the average particle distance was used. Here, we use the total distance because it is extensive in N. 
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We employ (18) adaptively, that is, we approximate U(t)/N with the ensemble mean of the 
distances to the target, measured w.r.t. the new metric ( [l4| and determine T®(t) solving 
the quartic equation (18). The parameter 7 can either be determined, approximately, at 


the beginning of the algorithm, or we may treat u /7 as our tuning parameter that governs 
the annealing speed. 

The second tuning parameter of the algorithm governs the mixing speed. We may 
simply choose, for the jump distribution k{d,d')^ a normal distribution whose covariance 
matrix is given by 

/3E(<)+ str(S(t))l, (20) 

where /3 is our tuning parameter and I](<) an empirical covariance matrix that may be 
estimated from the 6 components of the ensemble E (either at the beginning only or 
throughout run-time, hence the time dependence). The empirical covariance of the en¬ 


semble might degenerate. Therefore, we add the second term in (20), where s is some 
small constant, to make sure that the whole parameter space is explored. Notice that 
our previous derivation assumed a time-independent jump distribution. In certain cases, 
however, it might be benehcial to adapt it to the empirical covariance of the ensemble 
throughout run-time [ 2 ]. 

Before going to the general case we show that, in the limit of infinitely fast mixing 
and infinitely fast annealing assumption © is consistent with eqs. 0 and (|^. Infinitely 
fast mixing means that instead of jumping in parameter space we simply draw from the 
prior, and, since the prior is assumed to be flat, this is achieved by setting k{0,0') = 1. In 
this case only the marginal density of the energy u is affected by particle updates. Under 
the endoreversibility assumption and w.r.t. the new metric (|I4[), this density reads 


/i(u;t) = 


m 




( 21 ) 


Plugging (21) into the master equation (1^ yields, up to terms of order exp[—1/T(t)], 


Kt;u) = J ^min ^l,exp 


exp 


T^{t)\J T{t) 


min 1 , exp 


u — u 


exp 


-u/T{t) r 

'wri. 


exp 


T{t)_ 

( 1 1 


exp 


— {u' — u) 
{u' — u) 

~YW~ 


\T{t) T<^it) 
{u' — u) 


,-u/T{t) r pu 


T{t) 


exp 


—u 


exp 

1 


T®(f) 
1 


r ( 


L 

[ ml 


r«(t) T{t) 
u' 


du' 

— 1 ) du' 
du' 

du' — u 


— exp 


fl-exp 

T{t) \ ^ ^ ’ T{t) - T<^(t) \ ^ 

On the other hand, the temporal derivative of ([2T|) reads 


du' 

T(t)-T^{t) 




( 22 ) 


(23) 


Eqs. (22) and (23) are consistent in the limit of infinitely fast annealing, T®(f) = 0, upon 
setting T{t) = i.e., for T{t) ^ 1/t. This calculation also shows that, if annealing 
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has a finite speed, there is additional internal entropy production due to a violation of 
the endoreversibility assumption. However, unless annealing is extremely slow, we expect 
(10) to be the dominant source of entropy production. Notice that, as long as mixing 
in parameter space is sufficiently fast and the final temperature is sufficiently low, the 
^-distribution represented by our final ensemble will be a faithful representation of the 
posterior. A violation of the endoreversibility assumption only means that we might not 
have found the most efficient annealing schedule. 


2.2 The general case 

If the prior carries relevant information, we must make sure that it is appropriately rep¬ 
resented in the posterior. If we expect to learn little, i.e., if we expect the posterior not to 
differ too much from the prior, we can simply employ a rejection-ABC [ini m]. That is, 
we iteratively simulate from the joint prior fpri{9)L{Ti\6) and accept only those 0 whose 
corresponding x are close enough to the data y (w.r.t. a fixed and small tolerance). This 
algorithm becomes inefficient if the posterior differs a lot form the prior, in which case 
we better employ an algorithm that is based on random jumps in parameter space. In 
that case, we must make sure that prior and likelihood decide on equal footing whether a 
move is accepted or not. 

We address this problem introducing a second energy variable, 


W2(0) =-ln(/pri(0)) , (24) 

and replacing transition rate §by[3 

rn \\ i/a/Q\T/ \ Wl (x) - Ui (x') U 2 { 6 ) - U 2 { 6 ')] \ 

gT«»((0 ,x ) ^ (6>,x)) = fc(6> ,6>)L(x|0)mm ( l,exp- — - — - j, 

(25) 

where mi(x) denotes some user-defined distance to the measured data, such as the one 
given in eq. (|^. Transition (25) satisfies a detailed balance condition analogous to (|^, 
but for the equilibrium distribution 


with 


7rTe(6>,x) = Z-i(T®)L(x|6>)e-“b’^)/Tf-«2(0)/T| ^ 


Z(T'=) = J L(x|0)e-''bx)/Tf-«2(e)/T|^g,^^^ 


(26) 


(27) 


Analogously to Sect. |2.1[ for sufficiently fast mixing, we make the endoreversibility as¬ 
sumption, 

/r(z,t) « 7rT(z,t), (28) 

and expect the dominant source of entropy production to be given by eq. 

&{t) = , (29) 

with the extensities 


Ui{t) :=N ui(x)/r(0,X, t)d0dx « ^ui(xi), 

i=i 

N 

1 / 2 ( 1 ) := N / U2{0)n{0,:x.,t)d0dxKi'^u2{Ot), 


(30) 

(31) 


^We use a notation that is somewhat different from [2], in order to comply with physics’ notation. 
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and the thermodynamic forces 




( 32 ) 


The two-dimensional problem does not lend itself to a simple redefinition of the metric, as 
in the previous section. Therefore, we can only uphold the endoreversibility assumption 


(28), for sufficiently small thermodynamic forces. 


A necessary criterion for minimal entropy production, for fixed initial and final values 
of U, is given by m 

■ j, OF ■ 

U —= V = const . 

au 


(33) 


To derive an annealing schedule from (33) we need to know the functional dependencies 
between U and T and between U and F, respectively. For sufficiently small thermody¬ 
namic forces, we make the linearity assumption 


U(t) = L(U(t))F(t), 


(34) 


where L denotes the Onsa^er matrix (which is symmetric due to detailed balance). Under 
assumption (341, condition (33) leads to a constant entropy production rate m 


&{t) = tS^R{V)iJ = v, 


(35) 


where i?(U) := L ^(U) defines a metric on the (Ui, C/ 2 )-plane. Due to the Cauchy- 
Schwartz inequality 




1C 


tf ti 


(36) 


where 1C is the length of the process-path in the (Ui, C/ 2 )-plane, measured with the metric 


i?(U). The lower bound of (36) is assumed if the integrand is constant, i.e., if the entropy 
production rate is constant. Thus, under linearity assumption (34), a necessary and 


sufhcient condition for minimal entropy production is to travel along a path of minimal 
length (w.r.t. metric R) in such a way that the entropy production rate is constant [15) . 

To derive an adaptive annealing schedule from this we continuously need to estimate 
the Onsager matrix T(U), during run-time. To this end, we derive from ([^, (25), (28), 
(30) and (31) the fluctuation-dissipation theorem 


Lij(U) = Z i(T) J (u,(z) - uflz')){uj{z) - Uj{z')) 

X fc(0,0')L(x|0)L(x'|0') exp[-Mi(x)/ri - U 2 { 6 )/T 2 ] 

X X(('Wl(x) - Ui(x ))/Ti -P {u2{6) - U2{B'))IT2) 

X d'x.dyCdOdO' . (37) 


The r.h.s. of (37) can be estimated, during run-time, in different ways: It can be estimated, 
in a rather straightforward way, using both the ensemble E and the prior sample that is 
generated, as a side product, when the initial ensemble is generated [2]. In a later stage 
of the algorithm, we can also estimate it by means of a matrix of attempted moves that 
has been populated during run-time nil]. 

Since, at the beginning of the algorithm, neither is the target value for t/ 2 , at Ti = 0 
and T 2 = 1, known exactly nor is the metric 7?(U) = L“^(U) known globally. Thus, it 
appears difficult to come up with an optimal path in the (Ui, [/ 2 )-plane up-front. There¬ 
fore, we force the process to be on a path such that T 2 remains close to unity, at all 
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times. Then, the prior always has the correct weight. Practically, this can be achieved by 
applying a counter force, setting 


1 

-^2 




(38) 


where a is some positive tuning parameter. 

The last ingredient for the algorithm is a run-time estimate of the system’s intensities 
T, based on the measured extensities U. To this end we derive from (28), (30) and (31) 
the fluctuation-dissipation theorem 


dU y) 2 Var(Mi) 

aT Cov(mi,M2 ) 


^ Cov(ui, W 2 ) 
^ Var(it2) 


(39) 


The r.h.s. of (39) can easily be estimated with the empirical covariance of the ensemble 
E and used to estimate the change in intensities, AT, following a measured change of 
extensities, AU, by means of the approximation 


AT 



(40) 


Of course, occasionally, the estimate of T needs to be corrected using the endoreversibility 
assumption ([ 2 § [ 2 ]. 


3 Conclusions 

We have presented a scheme of algorithms that allow us to extract probabilistic informa¬ 
tion about model parameters from data. The scheme is generic in the sense that we only 
need access to the model in the form of a simulator, i.e., we do not need access to the 
density of the model’s likelihood function. 

Our scheme is inspired by a body of literature on adaptive simulated annealing pm 
iiiiiP- To design an efficient annealing schedule we employ the principle of minimal 
entropy production [HP- Our algorithms are adaptive as they determine the state of 
the system from measured thermodynamic extensive quantities. The underlying endore¬ 
versibility assumption (i.e. the assumption that the state of the system is approximately 
described by a Gibbs state w.r.t. the measured extensities) is approximately satisfied as 
long as the mixing within the system (determined by the jump width) is sufficiently fast 
and the annealing speed sufficiently slow. Under the endoreversibility assumption, entropy 
production is given by the flow of extensities from the system to the environment (or vice 
versa). Minimization of this entropy production leads to a family of adaptive annealing 
schedules that are basically parameterized by two parameters, determining, respectively, 
the mixing and the annealing speed. 

In contrast to optimization the complexity of our task is in the likelihood function, 
from which we only have to draw model outputs, and not in the metric that is used to 
accept or reject a suggested move in parameter space. In the practically relevant case of 
negligible prior information, a simple redefinition of the metric allows us to go beyond the 
linearity assumption that underlies many optimization algorithms and derive an optimal 
algorithm for fast annealing, which is not based on a constant entropy production rate. 
Furthermore, in this special case, the bias in the final result will be small as long as mixing 
in parameter space is sufficiently fast and the final temperature is sufficiently low. 

In the general case of non-negligible prior knowledge we need to take care that the 
prior is properly represented in the posterior. We do this by introducing a second en¬ 
ergy, measuring prior energy of a particle, and an associated temperature (control 
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parameter). For that general case, we can only derive an optimal annealing schedule that 
is sufficiently slow to justify both the endoreversibility assumption as well as a linearity 
assumption between thermodynamic forces and fluxes. Under this assumption, optimality 
leads to a constant entropy production rate [15] • While the adaptation of r| aims at a 
proper representation of the prior, a violation of the endoreversibility assumption might 
lead to a bias nonetheless. If the prior is very strong, in order to avoid a bias, annealing 
might have to be so slow that the algorithm becomes inferior to a simple rejection ABC 
algorithm mnii. This deficiency is inherent to all ABC algorithms that are based on a 
decreasing tolerance. 

As all ABC algorithms are based on drawing outputs from the likelihood function 
their efficiency drops drastically with the dimension of the output space. If the dimension 
of the output space is large, but the dimension of the parameter space small it is often 
possible to reduce the output dimension to the order of the dimension of the parameter 
space without loosing too much information about the parameters. A semi-automatic 
method to produce these so-called summary statistics has been proposed in l^. 

An implementation in the free statistics software of the methods described in this 
paper, including semi-automatic summary statistics, is available for free download from 
the author’s githulj^and as part of the R-package Easy-ABCj^ Furthermore, pseudocodes 
of the suggested algorithms along with several case studies can be found in the original 
publication [5|. 
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